Information Retrieval in Life Sciences: The LAILAPS Search Engine
نویسندگان
چکیده
Retrieval and citation of primary data is the important factor in the approaching e-science age. Solving the challenge of building a flexible but homogeneous bioinformatics information retrieval infrastructure to access and query the world life science databases is a crucial factor for an efficient building bioinformatics infrastructure. In this contribution, we demonstrate the use of nine features, which are determined per database entry, in combination with a neural networks as relevance approximator, a novel approach to increase the quality of information retrieval in life science. The implementation of this concept is the LAILAPS search portal. It was designed to support scientist to extract relevant records in a set of millions entries come from private or public databases. In order to consider the fact that data relevance is highly subjective, we support use specific training of several relevance predicting neural networks. In order to make the neural networks working, a continuously training of the networks is performed in background. Here, the system use the user feedback, eighter by conclusions from the user interaction with the query result browser or by manual rating the data quality. Featured by an intuitive web frontend, the user may search over millions of integrated life science data records. The web frontend comprise a browser for relevance ordered query result, a keyword based query system supporting auto completion, spelling suggestions and synonyms. A data browser is provided to inspect and rate matching data records, and finally a recommender system to suggest closely related records. The system is available at http://lailaps.ipk-gatersleben.de 1 Information Retrieval in Life Science “Getting information is not much of a challenge. Just head for Google, PubMed [Lu11] or Entrez [SEOK96] and get the related web page or database entry.” This issue one may get frequently from biologist, if the question be raised which preferred methods or systems are used to get relevant data for a particular biological question [DHW08]. However, getting reliable and relevant information, i.e. to the function of a protein or those proteins that are involved in cancer cell cycle, are much more challenging tasks. The user has the choice of about 1,200 life science databases [Gal12] with billions of database records. Intuitively, the first choice for information acquisition are web search engines. Web site ranking techniques order query hits by its relevance. However, trying to apply ranking methods that were developed to rank natural language text or WWW-sites to life science content and databases is questionable [RPB06]. For example, the top-ranked Google hit for “arginase” is a Wikipedia page. This is because the page is referenced by a high number of web-pages or Google assigned a manual defined priority rank. The hypothesis is: A high hyperlink in-degree of a page means high popularity and high popularity means high relevance. In order to find scientific relevant database entries, scientists need strong scientific evidence in relation to the specific research field. A dentist has other relevance criteria than a plant biologist or a patent agent. The intuitive and commonly used way at the scientist’s desktop is query refinement. Criteria like who published, in which journal, for which organism, evidence scores, surrounding keywords are of major importance. Even complete search guides are published, e.g. for dentists [Day01]. Other ranking algorithms use Term Frequency Inverse Document Frequency (TF-IDF) as ranking criteria. Apache-Lucene1 is a popular implementation of this concept and is frequently used in bioinformatics, like LuceGene from the GMOD project [ODC08], which is used for the EBI search frontend EBeye. The TF-IDF approach works well, but misses the semantic context between the database entries and the query. Andrade and Silva consider the similarity between the result entry and the search query itself as a top-ranking criterion [AS06], while Greifeneder [Gre10] proposes several possible relevance criteria, including the absolute or relative frequencies of the keyword(s) of the search query, the scope or the actuality of the webpage constituting the query result. a website or rather query result. Another approach is probabilistic relevancy ranking [ILDF07], whereby probabilistic values for the relevance of database fields and word combinations have to be predefined. In combination with a user feedback system, the probabilistic approach shows promising ranking performance [ABD06]. Semantic search engines use methods from natural language processing and dictionaries to predict the semantic most similar database entries. Such conceptual search strategies, implemented in GoPubMed [DS05] or ProMiner [HFM05], are frequently used algorithms in text mining projects. 2 The LAILAPS Search Engine In this contribution we apply the LAILAPS search engine [LSB10] as a system that combines a range of well discriminating database relevance features within a probabilistic model under consideration of user specific relevance profiles. The concept of the LAILAPS information retrieval portal is to provide an information retrieval infrastructure that meets the requirements of the e-science age and to offer an information retrieval platform for data research and exploration. To this end, we built a search engine with the aim to find relevant data in non-integrated life science databases. 1http://lucene.apache.org 2.1 The LAILAPS Relevance Prediction The strategy is to keep a much data structure of the imported databases as necessary to support relevance ranking. But we will integrate data bases at model, schema or data level. Instead, the LAILAPS stores the loaded life science databases in an entity-attributevalue (EAV) adapted database schema. This flexible concept enables the import of RFCcompatible CSV-formatted exports from life science databases, whereas each row comprise a database record and its columns the fields. For the database import, a interactive user interface is provided. For the imported databases, an inverse text index is computed using Apache-Lucene. Furthermore the user may provide synonyms and relevance influencing keywords. In the public available installations, we provide more than 1 million synonyms extracted from the NCBI Entrez system. The system was designed to provide the look and feel of a web search engine. To support a platform independent implementation and scalable service, we decided to use a JAVA 3-tier web application. The frontend is a web application that supports a keyword based search, a browser for relevance ordered query result, and a data browser to inspect and rate matching data records (figure 2.1). The feedback system enables the user to train the relevance prediction system with individual relevance ratings. Figure 1: The LAILAPS relevance prediction workflow The core of LAILAPS is a probabilistic model for relevance prediction on the basis of neural networks. To consider the fact that data relevance is highly subjective to the user of an information retrieval system, we support specific neural networks. Motivated by the observation of user behavior during search engine result inspection, we introduced a set of 9 features. They are well discriminating, and efficiently quantifiable to provide a reasonable fast implementation: 1. attribute in which the query term was found 2. database of the entry 3. frequency of all query terms in the entry and attribute 4. co-occurence, distances and order of the query terms in the entry 5. good or bad keyword near to the query terms 6. the organism to which the entry relates to 7. size of the data section in the entry 8. proportion of the attribute that is matched by the query term 9. whether a synonym expansion was necessary to get the hit 2.2 The Search Engine Software In order to meet current standards for web information systems and to provide a well scalable implementation to support hundreds parallel user sessions, LAILAPS is implemented as 3-tier system, consist of frontend, business logic and database backend. The frontend is a J2EE web application. The core features are 1. an ad-hoc keyword based query system, supporting auto completion, suggestion and result size estimation; 2. a data browser and feedback system to inspect and rate matching data records. The business server is implemented as JAVA RMI service and implement the required functions, such as query parsing, synonym expansion, query suggestion, text indexing, feature extraction, relevance prediction, relevance feedback collection. The backend manage the indexed life science databases as well as the text indexes and lookup tables. For this, we use a combination of relational database (H2), key-value database (BerkeleyDB) and inverted index database (Apache LUCENE). This enables LAILAPS to be hosted at single low cost server. Using an 2 core Intel CPU wit 2.4 GHz and a standard SATA HDD, LAILAPS query response time for broad queries with millions of hits (e.g. keyword “gene”) in less than 10 seconds. More selective queries take only some milliseconds. In the matter of fact, user rarely invest time to rate database entries. Rather they inform the search engine indirectly about the relevance of the visited database entry by their behaviour. The obvious reaction to an non-interesting entry is close the page. This and other so called implicite rating are used by the LAILAPS system: • clicked result entries • clicked entries above, below • activity time • scroll amount • mouse movement Figure 2: The LAILAPS web frontend – The keyword query submitted as keywords. They are expanded interactively by the query suggestion system. The matching database records are listed according to their relevance. Each record link to its original database and can be inspected embedded in the LAILAPS data browser and feedback system. Here the user may rate the quality and explorer related database entries. • page lost / got focus
منابع مشابه
LAILAPS: The Plant Science Search Engine
With the number of sequenced plant genomes growing, the number of predicted genes and functional annotations is also increasing. The association between genes and phenotypic traits is currently of great interest. Unfortunately, the information available today is widely scattered over a number of different databases. Information retrieval (IR) has become an all-encompassing bioinformatics method...
متن کاملThe LAILAPS Search Engine: Relevance Ranking in Life Science Databases
Search engines and retrieval systems are popular tools at a life science desktop. The manual inspection of hundreds of database entries, that reflect a life science concept or fact, is a time intensive daily work. Hereby, not the number of query results matters, but the relevance does. In this paper, we present the LAILAPS search engine for life science databases. The concept is to combine a no...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملA Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine
Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...
متن کاملUsing Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine
Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...
متن کامل